|
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter). The more elements that are added to the set, the larger the probability of false positives. Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if "conventional" error-free hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient core memory, an error-free hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom's technique uses a smaller hash area but still eliminates most unnecessary accesses. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses, an 85–15 form of the Pareto principle (). More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set (). ==Algorithm description== An ''empty Bloom filter'' is a bit array of bits, all set to 0. There must also be different hash functions defined, each of which maps or hashes some set element to one of the array positions with a uniform random distribution. To ''add'' an element, feed it to each of the hash functions to get array positions. Set the bits at all these positions to 1. To ''query'' for an element (test whether it is in the set), feed it to each of the hash functions to get array positions. If any of the bits at these positions is 0, the element is definitely not in the set if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple Bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem. The requirement of designing different independent hash functions can be prohibitive for large . For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass different initial values (such as 0, 1, ..., − 1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger and/or , independence among the hash functions can be relaxed with negligible increase in false positive rate (, ). Specifically, show the effectiveness of deriving the indices using enhanced double hashing or triple hashing, variants of double hashing that are effectively simple random number generators seeded with the two or three hash values. Removing an element from this simple Bloom filter is impossible because false negatives are not permitted. An element maps to bits, and although setting any one of those bits to zero suffices to remove the element, it also results in removing any other elements that happen to map onto that bit. Since there is no way of determining whether any other elements have been added that affect the bits for an element to be removed, clearing any of the bits would introduce the possibility for false negatives. One-time removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed. However, false positives in the second filter become false negatives in the composite filter, which may be undesirable. In this approach re-adding a previously removed item is not possible, as one would have to remove it from the "removed" filter. It is often the case that all the keys are available but are expensive to enumerate (for example, requiring many disk reads). When the false positive rate gets too high, the filter can be regenerated; this should be a relatively rare event. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Bloom filter」の詳細全文を読む スポンサード リンク
|